Skip to content

Conversation

@AndyAyersMS
Copy link
Member

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.
Copilot AI review requested due to automatic review settings January 7, 2026 20:07
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 7, 2026
@AndyAyersMS
Copy link
Member Author

PTAL @dotnet/wasm-contrib... this is a first draft so please comment / help fix.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for the WebAssembly calling convention used by R2R (Ready-to-Run) and JIT-compiled managed code in the CLR. The documentation describes how the runtime interoperates with WebAssembly's stack model, calling sequences, and garbage collection integration.

Key Changes:

  • Adds a new "Web Assembly ABI (R2R and JIT)" section to the CLR ABI documentation
  • Documents stack layout, argument passing conventions, prolog/epilog behavior, and calling sequences
  • Explains GC reference handling at call sites and the portable entry point mechanism


The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed.

It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frame descriptor is GCInfo, yes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also refer to the EH info, so a bit more general.

Copy link
Member

@BrzVlad BrzVlad Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need the stack frames to be linked together in order to provide support for walking managed frames ? So, from the sp value in the current frame, shouldn't we be able to obtain the sp of the parent frame in order to get the descriptor information, etc ? I guess the plan would be to fetch the previous sp from sp[-1]? Would there be methods where sp can be dynamically incremented, in which case this wouldn't work ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the last frame's base address, the plan is that the frames are self-descriptive.

To get the base address (using the scheme above where sp can diverge from $__stack_pointer), we rely on the fact stack walks can only happen once the R2R code has called back into native code (either helper methods, or the interpreter). These calls are passed sp as arguments and save that to the global $__stack_pointer and perhaps some other global or similar for easy access by the unwinder.

The frame descriptor will be at a known offset from this saved sp (likely 0) and the size of the frame will be stored in the descriptor, so the external code can compute the address of the parent frame that way, eg parent_sp = sp + sp[0].frameSize.

For dynamic-sized frames a copy of the prior sp can be likewise stored at some other known offset from sp) to provide the necessary chaining. If the frame grows then this value can be re-established to reflect the new size. Or we can equivalently store the total frame size.

If we follow Katelyn's proposal of keeping $__stack_pointer in sync for all managed methods then there's a bit less ceremony required, but from there the unwinding proceeds the same way.


## Incoming argument ABI

The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means helper calls all become managed->native boundaries that require a stack pointer update at start and end, right? Is that a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a code size cost for each helper call... it could be amortized with a custom wrapper for each helper that just does the $__stack_pointermaintenance.


## Outgoing call ABI

For direct managed calls, Wasm uses the Portable Entry Point feature to facilitate smooth interop with interpreted code. This means all managed calls are made indirectly, and the portable entry point is also passed as the last argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run will we optimize this for cases where we know both the caller and callee were crossgen'd? I'm fine with not specifying that yet though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can optimize that case. Though in general there is no guarantee the runtime will use R2R compiled callee code.

On Wasm this may be less of an issue because the cases where R2R method bodies end up being disqualified may not be possible.

Copy link
Member

@jkotas jkotas Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can optimize that case

Yes.

Though in general there is no guarantee the runtime will use R2R compiled callee code.

The fixups for the caller would have to verify that the directly callled method is going to use R2R too. If the fixup fails, R2R code for the caller would have to be rejected as well.

(We do something similar for ReJIT. If there is a ReJIT request for a method that got inlined, all methods that inlined it must be invalidated as well.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some text describing the extra validation needed if one R2R method directly calls another.

@kg
Copy link
Member

kg commented Jan 7, 2026

This is probably a good time to raise that I think we shouldn't pass the stack pointer in an argument.
Reasons for it:

  • get_local 0 is 6 bytes when linking because the linker relocation needs to be 5 bytes so it can be patched by the linker, IIRC. (We're not linking, though!)
  • Loading a local might be faster than loading a global in wasm (I don't know how to verify this though)

Reasons against it:

  • Code size goes up because we need to copy the stack pointer into and out of the global at very many locations
  • More room for bugs caused by the stack pointer getting out of sync with the local
  • The extra argument makes it more likely that actual arguments won't occupy argument registers once our wasm is jitted/aot'd

Given that we're not linking I view the sp argument thing as a Potential Optimization and I feel like it's premature. Single or Yowl may have a good reason why we should still do it though based on their experiences.

I believe what we would do instead is export the stack_pointer from the runtime's wasm module (this may already be the default behavior for emscripten, to enable dynamic linking?) then grab the WebAssembly.Global for the stack_pointer and import it into every r2r module we load. Then all our code can manipulate the linear stack the same way clang generated code does.

@AndyAyersMS
Copy link
Member Author

we shouldn't pass the stack pointer in an argument.

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

@SingleAccretion
Copy link
Contributor

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

It doesn't seem necessary to optimize for the stub case? Since stubs will be a very small proportion of the overall code, and every managed callsite will need to be made at least 2 bytes bigger (for global.set). In addition to the inflexibilities of hardcoding global indices.

@yowl
Copy link
Contributor

yowl commented Jan 8, 2026

How will a global.set work when threads are a thing?

@kg
Copy link
Member

kg commented Jan 8, 2026

How will a global.set work when threads are a thing?

Aren't globals functionally thread-local in wasm? Is it different for wasi?

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Jan 8, 2026

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

Another edit: more - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2024/THREADS-07-09.md.

@kg
Copy link
Member

kg commented Jan 8, 2026

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

This also appears to mention that importing a global turns it into two loads instead of one, which is a bit of a problem. But I think we have to import sp no matter what and it's just a question of how often we're going to touch it.

@pavelsavara pavelsavara added the arch-wasm WebAssembly architecture label Jan 8, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to 'arch-wasm': @lewing, @pavelsavara
See info in area-owners.md if you want to be subscribed.


## Incoming argument ABI

The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PInvokes, I expect that this update will be done around the callsite in the managed code.

Where do you expect it to be done for FCalls? We assume that FCalls have the same managed calling convention. Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?

If we are not able to do that, I guess we will need to create some sort of FCalls wrappers. It is doable, but it is not pretty - we have been there in the past.

For reference, what does native AOT / LLVM do for FCalls currently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NAOT/LLVM it looks like FCalls have an extra initial arg that they ignore:

https://github.com/dotnet/runtimelab/blob/7706cd182716062d4fa550e88abd004e1a82dcd5/src/coreclr/nativeaot/Runtime/MathHelpers.cpp#L12

I don't see where the stack pointer global is updated; maybe NAOT/LLVM doesn't need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, what does native AOT / LLVM do for FCalls currently?

NAOT-LLVM shadow stack is allocated separately from the __stack_pointer stack, so we only need to track it for the purposes of transition frames with virtual unwinding (another way of putting it is that it is only used for managed code).

Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?

I don't know if it is possible with __attribute__((naked)) trickery to do it all in one function, but it is definitely possible with __asm to insert a stub with the managed calling convention (that'd do global.set __stack_pointer) into FCIMPL.


The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed.

It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For methods with EH or GC

It is not clear what a "method with GC" means. Should this say for methods with calls or EH (ie non-leaf methods)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, methods with GC safe points (which will always be at calls) or EH.

```
Initially the cell will contain code to determine if the target method has R2R code or must be interpreted. If there is R2R code for the method it is fixed up as needed. Once the target is resolved the cell can be updated to just refer to the R2R code directly, if there is any, or to a thunk for invoking the interpreter.

For indirect managed calls the sequence is similar, but the portable entry point is obtained by calling a resolve helper:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should say "virtual managed calls". Indirect calls (calli) should get the portable-entry-point to call from IL stack, no need to call resolve helper.

Also, there is a potential optimization for vtable-based virtual calls to just fetch the entrypoint by indexing into vtable like we do everywhere else.

@AndyAyersMS
Copy link
Member Author

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

Global SP always in sync

;; PROLOG

global.get $__stack_pointer    ;; (2, if can get a small index), else 6.
i32.const FRAMESIZE            ;; (2 typically)
i32.sub                        ;; 1
dup                            ;; 1
global.set $__stack_pointer    ;; (2/6)
local.set sp                   ;; 2

;; EPILOG

local.get sp                   ;; 2
i32.const FRAMESIZE            ;; 2ish
i32.add                        ;; 1
global.set $__stack_pointer    ;; 2/6

So 10/18 bytes per prolog, 7/11 bytes per epilog

No overhead at call sites. Smaller signatures.

Global SP lazy sync at boundaries

;; PROLOG

local.get sp                  ;; 2
i32.const FRAMESIZE           ;; 2
i32.sub                       ;; 1
local.set sp                  ;; 2

;; EPILOG

(empty)

So 7 bytes per prolog, 0 per epilog

;; unmanaged call sites & fcalls

local.get  sp                 ;; 2
global.set $__stack_pointer   ;; 2/6   (~0 amortizable for fcalls)

;; managed call sites

local.get  sp                 ;; 2

For the current x86 crossgen SPMI collection (which may not represent the set of methods we care about) there are 273829 methods / prologs, 312716 epilogs, 1504289 managed call sites, and 163641 helper call sites.

So assuming the optimistic case where we can encode the global SP index in one byte and wrap FCalls with a global SP update (size assumed negligible) I get size estimates like:

global SP   in sync: 5.581M bytes  (~20 bytes/method)
global SP lazy sync: 4.925M bytes  (~18 bytes/method)

If we can't get a small index for the SP global then the "sync" cost rises to 8.3M (~30 bytes/method).

This may overstate the difference somewhat. For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Jan 9, 2026

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

This matches my calculations based on NAOT-LLVM data from last year almost perfectly: "50%" code size for the large index encoding, about the same for the small encoding.

For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

The proportion of truly leaf methods is going to be rather low. It was on the order of < 5% on real-world NAOT-LLVM data (this is from dotnet/runtimelab#2697). This is because most methods may throw (an NRE), requiring a helper call.

@jkotas
Copy link
Member

jkotas commented Jan 9, 2026

This is because most methods may throw (an NRE), requiring a helper call.

Was this with assumption that this can be null? It is a gray area. this can be never null in C#. I think it would be ok to assume that this is never null for wasm.

@SingleAccretion
Copy link
Contributor

Was this with assumption that this can be null?

Yes. I can re-measure how this works out with the non-null-this assumption. Though I personally wouldn't support such a thing "for WASM only". And it would be a breaking change for structs (((S*)null)->Method();, which works today, would become UB).

@jkotas
Copy link
Member

jkotas commented Jan 9, 2026

I believe we have some non-deterministic behaviors around null this pointer for reference types today. I would not feel bad about more UB there by default. It is impossible to end up with null this for reference types in C#.

it would be a breaking change for structs (((S*)null)->Method();, which works today

I guess we can keep them for structs.

@SingleAccretion
Copy link
Contributor

Single or Yowl may have a good reason why we should still do it though based on their experiences.

I'll write down my thinking on this question. I'll use $sp to represent the argument scheme and $__stack_pointer to represent the global.set scheme.

Code Size

As per Andy's study above (and my earlier investigation as well), the size impact is "weakly in favor" of using $sp. I've also looked into Jan's suggestion above about non-null TYP_REF this, and while it does increase the number of leafs (5% -> 8% for my Avalonia sample app), that is not significant enough. That said, I have not investigated what the numbers look like if we allow tail-calling the NRE helper (thus discarding the need for a frame), since this will degrade diagnostics and is not a parity experience with other platforms. I can look into that if needed [but it's less trivial than the experiments so far].

"Weakly in favor" because:

  • We have neutral impact for scenarios where we can hardcode the __stack_pointer index, i. e. traditional dynamically loaded R2R.
  • We have positive (significant) impact for scenarios where we can't hardcode __stack_pointer and aren't able to use "shrinking". This is the NAOT scenario, or a potential "fused" R2R scheme where we could combine native code directly with R2R code.

Throughput

This point is clearly in favor of $sp. As we also discussed above, an imported $__stack_pointer is at least two indirections, potentially (significantly?) more with multi-threading in the picture. Since $sp is going to one of the hottest value in the majority of the methods, it will also on balance be a good tradeoff to "use up" an register argument slot for it.

Runtime complexity & interop

This point is clearly in favor of $__stack_pointer, since:

  1. We won't need to deal with a class of bugs arising from "SP desync". I don't know how big of an issue is that. My speculation/feeling is that it shouldn't be a big problem.
  2. Simpler FCall implementation. I would however note that depending on whether we can find a clang builtin for "get caller's SP", we would still need manual assembly for those FCalls that need to be stackwalking roots (allocation helpers, potentially EH?).
  3. The managed<->native transitions get a bit faster and smaller (I don't think it is too important to optimize them though).

Based on the above I would personally be weakly in favor of $sp as the convention better reflecting the reality of managed code where most methods will need to use the stack, as opposed to native code where $__stack_pointer is only needed for address-taken structures / local arrays (recall the default WASM stack size is 65K).

@AndyAyersMS
Copy link
Member Author

From what I can tell there is no way to have LLVM inline assembly insert instructions before the prolog, so something like

int add(int* sp, int b) {
    __asm("local.get 0\n global.set __stack_pointer\n");
   return b; 
}

generates (without opts) code like this per compiler explorer

add(int*, int):
        global.get      __stack_pointer
        i32.const       16
        i32.sub 
        local.set       2
        local.get       2
        local.get       0
        i32.store       12
        local.get       2
        local.get       1
        i32.store       8

        ;; inline asm payload
        local.get       0
        global.set      __stack_pointer

        local.get       2
        i32.load        8
        return
        end_function

So it seems with the pure lazy $sp approach helper call wrappers for those helpers implemented in native code must be created in Wasm directly. With wrappers we could still add the extra ignored initial arg to the C++ helpers when building for wasm to avoid having to do arg shuffling assuming we don't have any wrapped helpers that return structs, though we'd have to adapt existing callers. We could make $sp be the last argument but this is less efficient for managed callees.

If we do the lazy sync and can't or don't want to do wrappers, we can reduce the sync cost a bit further by only setting $__stack_pointer in the prolog of methods that make helper calls. There are 0.6 helper calls per method but the fraction of methods with helper call sites is around 0.2.

@pavelsavara
Copy link
Member

pavelsavara commented Jan 12, 2026

From what I can tell there is no way to have LLVM inline assembly insert instructions before the prolog, so something like

Do I understand right that we only need this for NativeAOT-LLVM ?

Maybe it would be ok to create wrapper function with the global.set __stack_pointer and call to the real thing.
Then rely on wasm-opt --one-caller-inline-max-function-size

https://manpages.debian.org/testing/binaryen/wasm-opt.1.en.html#one

@pavelsavara
Copy link
Member

Here is how we create direct wasm with LLVM
https://github.com/dotnet/runtime/blob/main/src/mono/mono/utils/mono-threads-wasm.S

@SingleAccretion
Copy link
Contributor

I don't think we need any fancy tooling to create these wrappers. All fcalls are implemented with macros that already have most of the information we need - it will need to be augmented with the ABI types (alternatively, you can play games with compile-time string concatenation using constexpr). Concept:

// Usage
#define FCIMPL_VOID_I(foo, void *p)

// Rough definition
#define FCIMPL_VOID_I(funcname, a1) \
__asm(
  .functype funcname (i32, i32) -> ()
  .global funcname
funcname:
  local.get 0
  global.set __stack_pointer
  local.get 1
  call funcname#_native
  end_function
);

void F_CALL_CONV funcname##_native(a1) { FCIMPL_PROLOG(funcname)

@AndyAyersMS
Copy link
Member Author

Given the above, I propose that we go with the lazy $sp approach for now.

For the managed wrappers outlined above, do we need to be careful not to mess things up for the interpreter?

@jkotas
Copy link
Member

jkotas commented Jan 14, 2026

For the managed wrappers outlined above, do we need to be careful not to mess things up for the interpreter?

It should just work. The interpreter will call FCalls like any other method with native code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch-wasm WebAssembly architecture needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants